CUDA: Hide latency of bias and gate-loading for fused `mul_mat_vec_q` #16847

ORippler · 2025-10-29T18:13:03Z

This PR hides latency of bias and gate-loading for fused mul_mat_vec_q kernel by loading them into registers before computation of the dot-product, effectively batching them together with said dot-product. As a lot of threads are alive in this part of the kernel still, the warp scheduler has enough threads available to effectively hide the cost of loading those two single floats.

This gives 3-14% E2E speed-up for gpt-oss models (qwen3moe does not use bias and gate, and I am unaware of any other MoE model that uses bias and gate which I could run E2E perf tests on). The kernel themselves are up to 20% faster for gpt-oss.

GPU	Model	Test	t/s master	t/s this branch (babfd19)	Speedup
RTX 4000 SFF Ada	gpt-oss 20B MXFP4 MoE	tg128	85.41	88.20	1.03
RTX 4000 SFF Ada	gpt-oss 20B MXFP4 MoE	tg256	85.66	88.57	1.03
RTX 4000 SFF Ada	gpt-oss 20B MXFP4 MoE	tg512	84.80	87.62	1.03
RTX 6000 Ada	gpt-oss 20B MXFP4 MoE	tg128	240.55	248.09	1.03
RTX 6000 Ada	gpt-oss 20B MXFP4 MoE	tg256	245.03	253.23	1.03
RTX 6000 Ada	gpt-oss 20B MXFP4 MoE	tg512	242.32	250.41	1.03
RTX PRO 4500 BW	gpt-oss 20B MXFP4 MoE	tg128	212.97	223.10	1.05
RTX PRO 4500 BW	gpt-oss 20B MXFP4 MoE	tg256	220.06	236.34	1.07
RTX PRO 4500 BW	gpt-oss 20B MXFP4 MoE	tg512	218.21	237.13	1.09
RTX PRO 6000 BW Max-Q	gpt-oss 120B MXFP4 MoE	tg128	198.64	208.72	1.05
RTX PRO 6000 BW Max-Q	gpt-oss 120B MXFP4 MoE	tg256	222.68	238.11	1.07
RTX PRO 6000 BW Max-Q	gpt-oss 120B MXFP4 MoE	tg512	224.01	240.44	1.07
RTX PRO 6000 BW Max-Q	gpt-oss 20B MXFP4 MoE	tg128	296.58	338.69	1.14
RTX PRO 6000 BW Max-Q	gpt-oss 20B MXFP4 MoE	tg256	314.90	356.62	1.13
RTX PRO 6000 BW Max-Q	gpt-oss 20B MXFP4 MoE	tg512	329.31	353.40	1.07

This is realised by loading them into registers before computation of the dot-product, effectively batching them together with said dot-product. As a lot of threads are alive here, the warp scheduler has enough threads available to effectively hide the cost of additionally loading those two floats.

TinyServal · 2025-10-29T21:32:39Z

Fixes #16815, benchmarks for affected devices (ampere cards with low memory bandwidth) can be found in the comments.

Results on the RTX A4000:

Model	Test	t/s master (`b9ce940`)	t/s #16847 (`babfd19`)	Speedup
gpt-oss 20B MXFP4 MoE	tg128	113.70	118.39	1.04
gpt-oss 20B MXFP4 MoE	tg256	112.42	116.77	1.04
gpt-oss 20B MXFP4 MoE	tg512	109.96	114.29	1.04

ggml/src/ggml-cuda/mmvq.cu

Pointed out [here](ggml-org#16847 (comment)) that only a single value is needed per target col per thread

* CUDA: Remove unneded bias/gate dims in fused mmvq Pointed out [here](#16847 (comment)) that only a single value is needed per target col per thread * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Fix "Error 991-D: extra braces are nonstandard" during compilation --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

…is resolved revert ggml-org#16715 (+2 squashed commit) Squashed commit: [289af2ee2] Revert "Hide latency of bias and gate-loading (ggml-org#16847)" This reverts commit 8b11dee. [a3e5c1e95] Revert "CUDA: add unused vars to mmvf and mmvq (ggml-org#16807)" This reverts commit 463bbf2.

ORippler requested a review from JohannesGaessler as a code owner October 29, 2025 18:13

am17an mentioned this pull request Oct 29, 2025

CUDA Performance Regression on Jetson AGX Orin #16815

Closed

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 29, 2025

am17an approved these changes Oct 30, 2025

View reviewed changes

am17an merged commit 8b11dee into ggml-org:master Oct 30, 2025
71 of 72 checks passed

JohannesGaessler reviewed Oct 30, 2025

View reviewed changes

ggml/src/ggml-cuda/mmvq.cu Show resolved Hide resolved

JohannesGaessler mentioned this pull request Oct 30, 2025

Massively Improved ROCm/HIP rocWMMA Performance (pp and tg) #16827

Closed

ikawrakow mentioned this pull request Oct 30, 2025

Biased mmvq: minor optimization ikawrakow/ik_llama.cpp#880

Merged

ORippler added a commit to ORippler/llama.cpp that referenced this pull request Oct 30, 2025

CUDA: Remove unneded bias/gate dims in fused mmvq

44987f7

Pointed out [here](ggml-org#16847 (comment)) that only a single value is needed per target col per thread

ORippler mentioned this pull request Oct 30, 2025

CUDA: Remove unneded bias/gate dims in fused mmvq #16858

Merged

DajanaV mentioned this pull request Oct 31, 2025

UPSTREAM PR #16858: CUDA: Remove unneded bias/gate dims in fused mmvq auroralabs-loci/llama.cpp#17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: Hide latency of bias and gate-loading for fused `mul_mat_vec_q` #16847

CUDA: Hide latency of bias and gate-loading for fused `mul_mat_vec_q` #16847

Uh oh!

ORippler commented Oct 29, 2025 •

edited

Loading

Uh oh!

TinyServal commented Oct 29, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CUDA: Hide latency of bias and gate-loading for fused mul_mat_vec_q #16847

CUDA: Hide latency of bias and gate-loading for fused mul_mat_vec_q #16847

Uh oh!

Conversation

ORippler commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TinyServal commented Oct 29, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CUDA: Hide latency of bias and gate-loading for fused `mul_mat_vec_q` #16847

CUDA: Hide latency of bias and gate-loading for fused `mul_mat_vec_q` #16847

ORippler commented Oct 29, 2025 •

edited

Loading